sessionInfo()Biostat 203B Homework 3
Due Feb 23 @ 11:59PM
Display machine information for reproducibility:
Load necessary libraries (you can add more as needed).
library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:utils':
timestamp
library(gtsummary)
library(memuse)
library(pryr)
library(R.utils)Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':
throw
The following objects are masked from 'package:methods':
getClasses, getMethods
The following objects are masked from 'package:base':
attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
Attaching package: 'R.utils'
The following object is masked from 'package:arrow':
timestamp
The following object is masked from 'package:utils':
timestamp
The following objects are masked from 'package:base':
cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose() masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract() masks R.utils::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::partial() masks pryr::partial()
✖ dplyr::where() masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(GGally)Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Attaching package: 'GGally'
The following object is masked from 'package:R.utils':
wrap
Display your machine memory.
memuse::Sys.meminfo()In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.
Q1. Visualizing patient trajectory
Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.
Q1.1 ADT history
A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.
Do a similar visualization for the patient with
subject_id 10013310 using ggplot.
Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.
Answer
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients_tble# A tibble: 299,712 × 6
subject_id gender anchor_age anchor_year anchor_year_group dod
<dbl> <chr> <dbl> <dbl> <chr> <date>
1 10000032 F 52 2180 2014 - 2016 2180-09-09
2 10000048 F 23 2126 2008 - 2010 NA
3 10000068 F 19 2160 2008 - 2010 NA
4 10000084 M 72 2160 2017 - 2019 2161-02-13
5 10000102 F 27 2136 2008 - 2010 NA
6 10000108 M 25 2163 2014 - 2016 NA
7 10000115 M 24 2154 2017 - 2019 NA
8 10000117 F 48 2174 2008 - 2010 NA
9 10000178 F 59 2157 2017 - 2019 NA
10 10000248 M 34 2192 2014 - 2016 NA
# ℹ 299,702 more rows
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble# A tibble: 431,231 × 16
subject_id hadm_id admittime dischtime
<dbl> <dbl> <dttm> <dttm>
1 10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
2 10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
3 10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
4 10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
5 10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
6 10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
7 10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
8 10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
9 10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10 10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
# admit_provider_id <chr>, admission_location <chr>,
# discharge_location <chr>, insurance <chr>, language <chr>,
# marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
# hospital_expire_flag <dbl>
transfers_tble <- read_csv("~/mimic/hosp/transfers.csv.gz")Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): eventtype, careunit
dbl (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transfers_tble# A tibble: 1,890,972 × 7
subject_id hadm_id transfer_id eventtype careunit intime
<dbl> <dbl> <dbl> <chr> <chr> <dttm>
1 10000032 22595853 33258284 ED Emergency Depa… 2180-05-06 19:17:00
2 10000032 22595853 35223874 admit Transplant 2180-05-06 23:30:00
3 10000032 22595853 36904543 discharge <NA> 2180-05-07 17:21:27
4 10000032 22841357 34100253 discharge <NA> 2180-06-27 18:49:12
5 10000032 22841357 34703856 admit Transplant 2180-06-26 21:31:00
6 10000032 22841357 38112554 ED Emergency Depa… 2180-06-26 15:54:00
7 10000032 25742920 35509340 admit Transplant 2180-08-06 01:44:00
8 10000032 25742920 35968195 ED Emergency Depa… 2180-08-05 20:58:00
9 10000032 25742920 38883756 discharge <NA> 2180-08-07 17:50:44
10 10000032 29079034 32952584 ED Emergency Depa… 2180-07-22 16:24:00
# ℹ 1,890,962 more rows
# ℹ 1 more variable: outtime <dttm>
procedures_icd_tble <- read_csv("~/mimic/hosp/procedures_icd.csv.gz")Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedures_icd_tble# A tibble: 669,186 × 6
subject_id hadm_id seq_num chartdate icd_code icd_version
<dbl> <dbl> <dbl> <date> <chr> <dbl>
1 10000032 22595853 1 2180-05-07 5491 9
2 10000032 22841357 1 2180-06-27 5491 9
3 10000032 25742920 1 2180-08-06 5491 9
4 10000068 25022803 1 2160-03-03 8938 9
5 10000117 27988844 1 2183-09-19 0QS734Z 10
6 10000280 25852320 1 2151-03-18 8938 9
7 10000560 28979390 1 2189-10-16 5551 9
8 10000635 26134563 1 2136-06-19 3734 9
9 10000635 26134563 2 2136-06-19 3728 9
10 10000635 26134563 3 2136-06-19 3727 9
# ℹ 669,176 more rows
#diagnoses_icd_tble <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz")
#diagnoses_icd_tble
d_icd_procedures_tble <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures_tble# A tibble: 85,257 × 3
icd_code icd_version long_title
<chr> <dbl> <chr>
1 0001 9 Therapeutic ultrasound of vessels of head and neck
2 0002 9 Therapeutic ultrasound of heart
3 0003 9 Therapeutic ultrasound of peripheral vascular vessels
4 0009 9 Other therapeutic ultrasound
5 001 10 Central Nervous System and Cranial Nerves, Bypass
6 0010 9 Implantation of chemotherapeutic agent
7 0011 9 Infusion of drotrecogin alfa (activated)
8 0012 9 Administration of inhaled nitric oxide
9 0013 9 Injection or infusion of nesiritide
10 0014 9 Injection or infusion of oxazolidinone class of antibio…
# ℹ 85,247 more rows
d_icd_diagnoses_tble <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses_tble# A tibble: 109,775 × 3
icd_code icd_version long_title
<chr> <dbl> <chr>
1 0010 9 Cholera due to vibrio cholerae
2 0011 9 Cholera due to vibrio cholerae el tor
3 0019 9 Cholera, unspecified
4 0020 9 Typhoid fever
5 0021 9 Paratyphoid fever A
6 0022 9 Paratyphoid fever B
7 0023 9 Paratyphoid fever C
8 0029 9 Paratyphoid fever, unspecified
9 0030 9 Salmonella gastroenteritis
10 0031 9 Salmonella septicemia
# ℹ 109,765 more rows
Import transfer.csv.gz as a tibble sid_adt:
sid <- 10013310
sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
filter(subject_id == sid) |>
print(width = Inf)Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): eventtype, careunit
dbl (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 14 × 7
subject_id hadm_id transfer_id eventtype
<dbl> <dbl> <dbl> <chr>
1 10013310 21243435 31696219 discharge
2 10013310 21243435 31736720 ED
3 10013310 21243435 33511674 transfer
4 10013310 21243435 34848129 transfer
5 10013310 21243435 38910974 admit
6 10013310 22098926 31651850 transfer
7 10013310 22098926 32769810 admit
8 10013310 22098926 33278851 transfer
9 10013310 22098926 34063502 ED
10 10013310 22098926 36029206 discharge
11 10013310 27682188 30077870 transfer
12 10013310 27682188 30444898 discharge
13 10013310 27682188 31203589 admit
14 10013310 27682188 35160955 ED
careunit intime
<chr> <dttm>
1 <NA> 2153-06-05 19:58:00
2 Emergency Department 2153-05-26 08:56:00
3 Medicine/Cardiology 2153-05-26 16:19:26
4 Medicine/Cardiology 2153-05-26 14:42:55
5 Medicine/Cardiology 2153-05-26 14:18:39
6 Neuro Intermediate 2153-06-12 16:31:33
7 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
8 Medicine 2153-06-16 19:03:14
9 Emergency Department 2153-06-10 10:40:00
10 <NA> 2153-07-21 18:02:28
11 Medicine/Cardiology 2153-05-07 20:47:19
12 <NA> 2153-05-13 15:36:52
13 Coronary Care Unit (CCU) 2153-05-06 18:28:00
14 Emergency Department 2153-05-06 10:21:00
outtime
<dttm>
1 NA
2 2153-05-26 14:18:39
3 2153-06-05 19:58:00
4 2153-05-26 16:19:26
5 2153-05-26 14:42:55
6 2153-06-16 19:03:14
7 2153-06-12 16:31:33
8 2153-07-21 18:02:28
9 2153-06-10 11:55:42
10 NA
11 2153-05-13 15:36:52
12 NA
13 2153-05-07 20:47:19
14 2153-05-06 18:28:00
Store patient’s demographic information as age, sex, and race:
age <- patients_tble %>%
filter(subject_id == sid) %>%
pull(anchor_age)
sex <- patients_tble %>%
filter(subject_id == sid) %>%
pull(gender)
race <- admissions_tble %>%
filter(subject_id == sid) %>%
distinct(race) %>%
pull()Store the top 3 diagnoses as top3_diag:
top3_diag_icd <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") %>%
filter(subject_id == sid) %>%
slice_head(n = 3) %>%
pull(icd_code)Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top3_diag_icd[1] "I222" "I5023" "I428"
top3_diag <- d_icd_diagnoses_tble %>%
filter(icd_code %in% top3_diag_icd) %>%
pull(long_title)
top3_diag[1] "Subsequent non-ST elevation (NSTEMI) myocardial infarction"
[2] "Other cardiomyopathies"
[3] "Acute on chronic systolic (congestive) heart failure"
Subset labevents_tble to sid_lab:
sid_lab <- arrow::open_dataset("labevents_pq", format = "parquet") %>%
filter(subject_id == sid) %>%
select(charttime) %>%
collect()
sid_lab# A tibble: 2,285 × 1
charttime
<dttm>
1 2153-05-06 03:30:00
2 2153-05-06 03:30:00
3 2153-05-06 03:30:00
4 2153-05-06 03:30:00
5 2153-05-06 03:30:00
6 2153-05-06 03:30:00
7 2153-05-06 03:30:00
8 2153-05-06 03:30:00
9 2153-05-06 03:30:00
10 2153-05-06 03:30:00
# ℹ 2,275 more rows
Subset procedures_icd_tble:
sid_procedure <- procedures_icd_tble %>%
filter(subject_id == sid) %>%
select(chartdate, icd_code)
sid_procedure <- left_join(sid_procedure, d_icd_procedures_tble, by = "icd_code") %>%
select(chartdate, long_title)
sid_procedure$chartdate <- sid_procedure$chartdate %>% as_datetime()
sid_procedure# A tibble: 9 × 2
chartdate long_title
<dttm> <chr>
1 2153-05-27 00:00:00 Measurement of Cardiac Sampling and Pressure, Left Heart,…
2 2153-05-27 00:00:00 Fluoroscopy of Multiple Coronary Arteries using Low Osmol…
3 2153-05-27 00:00:00 Ultrasonography of Multiple Coronary Arteries, Intravascu…
4 2153-06-10 00:00:00 Extirpation of Matter from Intracranial Artery, Percutane…
5 2153-06-10 00:00:00 Introduction of Other Thrombolytic into Peripheral Artery…
6 2153-07-15 00:00:00 Insertion of Feeding Device into Stomach, Percutaneous Ap…
7 2153-06-11 00:00:00 Introduction of Nutritional Substance into Upper GI, Via …
8 2153-05-06 00:00:00 Dilation of Coronary Artery, One Artery with Drug-eluting…
9 2153-05-06 00:00:00 Fluoroscopy of Multiple Coronary Arteries using Other Con…
Generate the final plot:
ggplot() +
# ADT
geom_segment(data = sid_adt %>%
filter(eventtype != "discharge"),
aes(x = intime,
xend = outtime,
y = "ADT",
yend = "ADT",
color = careunit,
linewidth = str_detect(careunit, "(ICU|CCU)"))) +
# Lab
geom_point(data = sid_lab,
aes(x = charttime,
y = "Lab"),
shape = 3,
size = 3) +
# Procedure
geom_point(data = sid_procedure,
aes(x = chartdate,
y = "Procedure",
shape = long_title,
size = 3)) +
scale_shape_manual(values = c(2:10)) +
#guides(shape = guide_legend(title = "Procedure")) +
guides(linewidth = "none",
size = "none",
color = guide_legend(title = "Care Unit"),
shape = guide_legend(title = "Procedure", nrow = 5)) +
labs(
x = "Calendar Time",
y = "",
title = str_c("Patient ", sid, ", ", sex, ", ", age, " years old, ", race),
subtitle = str_c(top3_diag[1],"\n", top3_diag[2],"\n", top3_diag[3])
) +
theme_light() +
theme(legend.position = "bottom",
legend.box = "vertical") +
scale_y_discrete(limits = c("Procedure", "Lab", "ADT"))Warning: Using linewidth for a discrete variable is not advised.
# Free RAM usage for subsequent rendering
rm(transfers_tble,
procedures_icd_tble,
d_icd_procedures_tble,
d_icd_diagnoses_tble)Q1.2 ICU stays
ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.
Do a similar visualization for the patient 10013310.
Answer
Create a .parquet file for chartevents.csv:
# I chose arrow::open_csv_dataset() for HW2 Q3, so I have to convert it into
# a .parqeut file in this HW.
arrow::open_csv_dataset("/home/zhili/203b-hw/hw2/chartevents.csv") %>%
arrow::write_dataset(format = "parquet",
path = "/home/zhili/203b-hw/hw2/chartevents_pq")Read in the .parquet file:
chartevents_tble <- arrow::open_dataset("chartevents_pq",
format = "parquet") %>%
filter(subject_id == sid) %>%
filter(itemid %in% c(220045, 220180, 220179, 223761, 220210)) %>%
collect()
chartevents_tble# A tibble: 549 × 11
subject_id hadm_id stay_id caregiver_id charttime
<int> <int> <int> <int> <dttm>
1 10013310 22098926 32769810 59913 2153-06-12 12:01:00
2 10013310 22098926 32769810 59913 2153-06-12 12:01:00
3 10013310 22098926 32769810 59913 2153-06-12 12:50:00
4 10013310 22098926 32769810 59913 2153-06-12 12:50:00
5 10013310 22098926 32769810 59913 2153-06-12 13:00:00
6 10013310 22098926 32769810 59913 2153-06-12 13:00:00
7 10013310 22098926 32769810 59913 2153-06-12 13:00:00
8 10013310 22098926 32769810 59913 2153-06-12 13:03:00
9 10013310 22098926 32769810 59913 2153-06-12 13:03:00
10 10013310 22098926 32769810 59913 2153-06-12 14:00:00
# ℹ 539 more rows
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
# valuenum <dbl>, valueuom <chr>, warning <int>
chartevents_tble$itemid <- as.factor(chartevents_tble$itemid)Generate the plot:
# labeling the itemid
vital_labels <- c(`220045` = "HR",
`220180` = "NBPd",
`220179` = "NBPs",
`220210` = "RR",
`223761` = "Temperature")
chartevents_tble$itemid <- factor(chartevents_tble$itemid, levels = names(vital_labels))ggplot(chartevents_tble, aes(x = charttime, y = valuenum)) +
geom_line(aes(color = itemid)) +
geom_point(aes(color = itemid)) +
facet_grid(itemid ~ stay_id,
scales = "free",
labeller = labeller(itemid = vital_labels)) +
labs(
x = "Calendar Time",
y = "Vital",
title = str_c("Patient ", sid, " ICU stays - Vitals")
) +
theme_light() +
guides(color = "none")Q2. ICU stays
icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are
zcat < ~/mimic/icu/icustays.csv.gz | headQ2.1 Ingestion
Import icustays.csv.gz as a tibble icustays_tble.
Answer:
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): first_careunit, last_careunit
dbl (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble# A tibble: 73,181 × 8
subject_id hadm_id stay_id first_careunit last_careunit intime
<dbl> <dbl> <dbl> <chr> <chr> <dttm>
1 10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
2 10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
3 10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
4 10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
5 10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
6 10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
7 10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
8 10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
9 10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10 10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 2 more variables: outtime <dttm>, los <dbl>
Q2.2 Summary and visualization
How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
Answer:
There are 50,920 unique subject_ids. A subject_id can have multiple ICU stays as shown below. The patient with subject_id 18358138 has 37 ICU stays, the most among all patients.
icustays_tble %>% distinct(subject_id)# A tibble: 50,920 × 1
subject_id
<dbl>
1 10000032
2 10000980
3 10001217
4 10001725
5 10001884
6 10002013
7 10002155
8 10002348
9 10002428
10 10002430
# ℹ 50,910 more rows
icustay_count <- icustays_tble %>%
count(subject_id) %>%
arrange(desc(n)) %>%
print(width = Inf)# A tibble: 50,920 × 2
subject_id n
<dbl> <int>
1 18358138 37
2 12468016 33
3 17585185 33
4 13269859 30
5 18676703 26
6 11281568 25
7 15455517 25
8 15573773 24
9 17937834 24
10 15131736 22
# ℹ 50,910 more rows
icustay_count %>%
ggplot(aes(x = n)) +
geom_bar(color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Number of ICU stays",
y = "Number of patients",
title = "Number of ICU stays per patient"
) +
theme_light()icustay_count %>%
filter(n > 5) %>%
ggplot(aes(x = n)) +
geom_bar(color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Number of ICU stays",
y = "Number of patients",
title = "Number of ICU stays per patient (stays >= 6)"
) +
scale_x_continuous(breaks = seq(6, 37, 1)) +
theme_light()Q3. admissions data
Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are
zcat < ~/mimic/hosp/admissions.csv.gz | headQ3.1 Ingestion
Import admissions.csv.gz as a tibble admissions_tble.
Answer:
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble# A tibble: 431,231 × 16
subject_id hadm_id admittime dischtime
<dbl> <dbl> <dttm> <dttm>
1 10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
2 10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
3 10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
4 10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
5 10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
6 10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
7 10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
8 10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
9 10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10 10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <chr>,
# admit_provider_id <chr>, admission_location <chr>,
# discharge_location <chr>, insurance <chr>, language <chr>,
# marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
# hospital_expire_flag <dbl>
Q3.2 Summary and visualization
Summarize the following information by graphics and explain any patterns you see.
- number of admissions per patient
- admission hour (anything unusual?)
- admission minute (anything unusual?)
- length of hospital stay (from admission to discharge) (anything unusual?)
According to the MIMIC-IV documentation,
All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.
Answer:
Transform the data and summarize the information:
adm_count <- admissions_tble %>%
count(subject_id)
adm_count# A tibble: 180,733 × 2
subject_id n
<dbl> <int>
1 10000032 4
2 10000068 1
3 10000084 2
4 10000108 1
5 10000117 2
6 10000248 1
7 10000280 1
8 10000560 1
9 10000635 1
10 10000719 1
# ℹ 180,723 more rows
adm_hour <- admissions_tble %>%
select(admittime) %>%
pull() %>%
format("%H") %>%
as_tibble()
adm_hour# A tibble: 431,231 × 1
value
<chr>
1 22
2 18
3 23
4 12
5 23
6 01
7 05
8 23
9 02
10 18
# ℹ 431,221 more rows
adm_minute <- admissions_tble %>%
select(admittime) %>%
pull() %>%
format("%M") %>%
as_tibble()
adm_minute# A tibble: 431,231 × 1
value
<chr>
1 23
2 27
3 44
4 35
5 16
6 56
7 11
8 17
9 05
10 10
# ℹ 431,221 more rows
adm_los <- admissions_tble %>%
mutate(los = as.numeric(dischtime - admittime, units = "hours")) %>%
select(los)
adm_los# A tibble: 431,231 × 1
los
<dbl>
1 18.9
2 24.4
3 42.1
4 53.3
5 7.17
6 109.
7 10.9
8 9.78
9 12.8
10 70.3
# ℹ 431,221 more rows
Plot the transformed data:
Number of admissions per patient
adm_count %>%
ggplot(aes(x = n)) +
geom_histogram(binwidth = 1,
color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Number of admissions",
y = "Number of patients",
title = "Number of admissions per patient"
) +
geom_vline(xintercept = 10, color = "red") +
annotate("text", x = 30, y = -5000, label = "10 admissions", color = "red") +
theme_light()Admission hour
adm_hour %>%
ggplot(aes(x = value)) +
geom_bar(fill = "lightskyblue",
color = "darkblue") +
labs(
x = "Admission hour",
y = "Number of admissions",
title = "Admission hour"
) +
theme_light()Admission minute
top4admmin <- adm_minute %>%
count(value) %>%
arrange(desc(n)) %>%
slice_head(n = 4)
adm_minute %>%
ggplot(aes(x = as.numeric(value))) +
geom_histogram(binwidth = 1,
color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Admission minute",
y = "Number of admissions",
title = "Admission minute"
) +
annotate("text",
x = as.numeric(top4admmin[1,1]),
y = as.numeric(top4admmin[1,2]) + 1000 ,
label = str_c(top4admmin[1,1], ": ", top4admmin[1,2]),
color = "red") +
annotate("text",
x = as.numeric(top4admmin[2,1]),
y = as.numeric(top4admmin[2,2]) + 1000 ,
label = str_c(top4admmin[2,1], ": ", top4admmin[2,2]),
color = "red") +
annotate("text",
x = as.numeric(top4admmin[3,1]),
y = as.numeric(top4admmin[3,2]) + 1000 ,
label = str_c(top4admmin[3,1], ": ", top4admmin[3,2]),
color = "red") +
annotate("text",
x = as.numeric(top4admmin[4,1]),
y = as.numeric(top4admmin[4,2]) + 1000 ,
label = str_c(top4admmin[4,1], ": ", top4admmin[4,2]),
color = "red") +
theme_light()Length of hospital stay
adm_los %>%
ggplot(aes(x = los)) +
geom_histogram(binwidth = 24,
color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Length of hospital stay (hours)",
y = "Number of admissions",
title = "Length of hospital stay"
) +
theme_light()adm_los %>%
filter(los < 1000) %>%
ggplot(aes(x = los)) +
geom_histogram(binwidth = 24,
color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Length of hospital stay (hours)",
y = "Number of admissions",
title = "Length of hospital stay (< 1,000 hours)"
) +
theme_light()- The number of admissions per patient is right-skewed, with most patients having 1 or 2 admissions. There are some outliers distributed evenly from more than 25 admissions, among which we can see a patient that has 238 admissions which is far more than the rest.
adm_count %>% arrange(desc(n))# A tibble: 180,733 × 2
subject_id n
<dbl> <int>
1 15496609 238
2 15464144 185
3 10714009 163
4 16662316 142
5 15229574 130
6 14394983 129
7 11553072 98
8 13475033 97
9 17517983 95
10 17011846 94
# ℹ 180,723 more rows
The admission hour is not showing any unusual pattern. The most admissions occur at 12 a.m. at the midnight and generally there are more admissions from 14 p.m. to 12 a.m. than the rest of the day. Also we can see that there are more admissions at 7 a.m. than the nearby hours.
The admission minute is almost distributed evenly, except for the top 4 minutes which are
00,15,30, and45, with frequency of26,319,22,243,12,673, and9,239, respectively.The length of hospital stay (los) is right-skewed, with most admissions having a length of stay less than 100 hours.
Q4. patients data
Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are
zcat < ~/mimic/hosp/patients.csv.gz | headQ4.1 Ingestion
Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.
Answer:
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): gender, anchor_year_group
dbl (3): subject_id, anchor_age, anchor_year
date (1): dod
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients_tble# A tibble: 299,712 × 6
subject_id gender anchor_age anchor_year anchor_year_group dod
<dbl> <chr> <dbl> <dbl> <chr> <date>
1 10000032 F 52 2180 2014 - 2016 2180-09-09
2 10000048 F 23 2126 2008 - 2010 NA
3 10000068 F 19 2160 2008 - 2010 NA
4 10000084 M 72 2160 2017 - 2019 2161-02-13
5 10000102 F 27 2136 2008 - 2010 NA
6 10000108 M 25 2163 2014 - 2016 NA
7 10000115 M 24 2154 2017 - 2019 NA
8 10000117 F 48 2174 2008 - 2010 NA
9 10000178 F 59 2157 2017 - 2019 NA
10 10000248 M 34 2192 2014 - 2016 NA
# ℹ 299,702 more rows
Q4.2 Summary and visualization
Summarize variables gender and anchor_age by graphics, and explain any patterns you see.
Answer:
Plot gender:
patients_tble %>% ggplot() +
geom_bar(aes(x = gender),
color = "darkblue",
fill = "lightskyblue") +
labs(
x = "Gender",
y = "Count",
title = "Gender count"
) +
scale_x_discrete(labels = c("Female", "Male")) theme_light()List of 97
$ line :List of 6
..$ colour : chr "black"
..$ linewidth : num 0.5
..$ linetype : num 1
..$ lineend : chr "butt"
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ rect :List of 5
..$ fill : chr "white"
..$ colour : chr "black"
..$ linewidth : num 0.5
..$ linetype : num 1
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ text :List of 11
..$ family : chr ""
..$ face : chr "plain"
..$ colour : chr "black"
..$ size : num 11
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : num 0
..$ lineheight : num 0.9
..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ title : NULL
$ aspect.ratio : NULL
$ axis.title : NULL
$ axis.title.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.x.bottom : NULL
$ axis.title.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : num 90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.title.y.left : NULL
$ axis.title.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : num -90
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "grey30"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.top :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : num 0
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.x.bottom : NULL
$ axis.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 1
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.text.y.left : NULL
$ axis.text.y.right :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ axis.ticks :List of 6
..$ colour : chr "grey70"
..$ linewidth : 'rel' num 0.5
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ axis.ticks.x : NULL
$ axis.ticks.x.top : NULL
$ axis.ticks.x.bottom : NULL
$ axis.ticks.y : NULL
$ axis.ticks.y.left : NULL
$ axis.ticks.y.right : NULL
$ axis.ticks.length : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ axis.ticks.length.x : NULL
$ axis.ticks.length.x.top : NULL
$ axis.ticks.length.x.bottom: NULL
$ axis.ticks.length.y : NULL
$ axis.ticks.length.y.left : NULL
$ axis.ticks.length.y.right : NULL
$ axis.line : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ axis.line.x : NULL
$ axis.line.x.top : NULL
$ axis.line.x.bottom : NULL
$ axis.line.y : NULL
$ axis.line.y.left : NULL
$ axis.line.y.right : NULL
$ legend.background :List of 5
..$ fill : NULL
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ legend.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ legend.spacing.x : NULL
$ legend.spacing.y : NULL
$ legend.key :List of 5
..$ fill : chr "white"
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ legend.key.size : 'simpleUnit' num 1.2lines
..- attr(*, "unit")= int 3
$ legend.key.height : NULL
$ legend.key.width : NULL
$ legend.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.text.align : NULL
$ legend.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ legend.title.align : NULL
$ legend.position : chr "right"
$ legend.direction : NULL
$ legend.justification : chr "center"
$ legend.box : NULL
$ legend.box.just : NULL
$ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
..- attr(*, "unit")= int 1
$ legend.box.background : list()
..- attr(*, "class")= chr [1:2] "element_blank" "element"
$ legend.box.spacing : 'simpleUnit' num 11points
..- attr(*, "unit")= int 8
$ panel.background :List of 5
..$ fill : chr "white"
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ panel.border :List of 5
..$ fill : logi NA
..$ colour : chr "grey70"
..$ linewidth : 'rel' num 1
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ panel.spacing : 'simpleUnit' num 5.5points
..- attr(*, "unit")= int 8
$ panel.spacing.x : NULL
$ panel.spacing.y : NULL
$ panel.grid :List of 6
..$ colour : chr "grey87"
..$ linewidth : NULL
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.major :List of 6
..$ colour : NULL
..$ linewidth : 'rel' num 0.5
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.minor :List of 6
..$ colour : NULL
..$ linewidth : 'rel' num 0.25
..$ linetype : NULL
..$ lineend : NULL
..$ arrow : logi FALSE
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_line" "element"
$ panel.grid.major.x : NULL
$ panel.grid.major.y : NULL
$ panel.grid.minor.x : NULL
$ panel.grid.minor.y : NULL
$ panel.ontop : logi FALSE
$ plot.background :List of 5
..$ fill : NULL
..$ colour : chr "white"
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ plot.title :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.title.position : chr "panel"
$ plot.subtitle :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : num 0
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 0.8
..$ hjust : num 1
..$ vjust : num 1
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 5.5points 0points 0points 0points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.caption.position : chr "panel"
$ plot.tag :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : 'rel' num 1.2
..$ hjust : num 0.5
..$ vjust : num 0.5
..$ angle : NULL
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ plot.tag.position : chr "topleft"
$ plot.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
..- attr(*, "unit")= int 8
$ strip.background :List of 5
..$ fill : chr "grey70"
..$ colour : logi NA
..$ linewidth : NULL
..$ linetype : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_rect" "element"
$ strip.background.x : NULL
$ strip.background.y : NULL
$ strip.clip : chr "inherit"
$ strip.placement : chr "inside"
$ strip.text :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : chr "white"
..$ size : 'rel' num 0.8
..$ hjust : NULL
..$ vjust : NULL
..$ angle : NULL
..$ lineheight : NULL
..$ margin : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
.. ..- attr(*, "unit")= int 8
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.x : NULL
$ strip.text.x.bottom : NULL
$ strip.text.x.top : NULL
$ strip.text.y :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num -90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.y.left :List of 11
..$ family : NULL
..$ face : NULL
..$ colour : NULL
..$ size : NULL
..$ hjust : NULL
..$ vjust : NULL
..$ angle : num 90
..$ lineheight : NULL
..$ margin : NULL
..$ debug : NULL
..$ inherit.blank: logi TRUE
..- attr(*, "class")= chr [1:2] "element_text" "element"
$ strip.text.y.right : NULL
$ strip.switch.pad.grid : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
$ strip.switch.pad.wrap : 'simpleUnit' num 2.75points
..- attr(*, "unit")= int 8
- attr(*, "class")= chr [1:2] "theme" "gg"
- attr(*, "complete")= logi TRUE
- attr(*, "validate")= logi TRUE
Plot anchor_age:
patients_tble %>% ggplot() +
geom_histogram(aes(x = anchor_age, fill = gender),
binwidth = 5,
color = "black") +
labs(
x = "Age",
y = "Count",
title = "Age distribution vs. Gender"
) +
guides(fill = guide_legend(title = "Gender")) +
theme_light()There are slightly more female patients than male patients in the
patient.csv.gzdata set. The total number of female patients is slightly over150,000, while the total number of male patients is slightly below150,000.The age distribution is right-skewed, with most patients being younger than 30 years old. The number of patients shows a decreasing trend as age increases when age is greater than 55 years old. In general, the proportion of 2 genders in each age group is similar, some times female is slightly more than male.
Q5. Lab results
labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are
zcat < ~/mimic/hosp/labevents.csv.gz | headd_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.
zcat < ~/mimic/hosp/d_labitems.csv.gz | headWe are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.
Answer:
# chunk 33
labevents_tble <- arrow::open_dataset("labevents_pq", format = "parquet") %>%
filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931) &
subject_id %in% icustays_tble$subject_id) %>%
collect()
labevents_tble# A tibble: 13,473,870 × 16
labevent_id subject_id hadm_id specimen_id itemid order_provider_id
<int> <int> <int> <int> <int> <chr>
1 10 10000032 NA 52958335 50882 P28Z0X
2 14 10000032 NA 52958335 50902 P28Z0X
3 19 10000032 NA 52958335 50912 P28Z0X
4 33 10000032 NA 52958335 50971 P28Z0X
5 34 10000032 NA 52958335 50983 P28Z0X
6 60 10000032 NA 73913913 50931 P28Z0X
7 63 10000032 NA 81886159 51221 P28Z0X
8 75 10000032 NA 81886159 51301 P28Z0X
9 78 10000032 NA 31644093 51221 <NA>
10 89 10000032 NA 31644093 51301 <NA>
# ℹ 13,473,860 more rows
# ℹ 10 more variables: charttime <dttm>, storetime <dttm>, value <chr>,
# valuenum <dbl>, valueuom <chr>, ref_range_lower <dbl>,
# ref_range_upper <dbl>, flag <chr>, priority <chr>, comments <chr>
# Step 1: Join the tables on subject_id
# Optimized: combined into step 2 to save RAM
#joined_tble <- icustays_tble %>%
# left_join(labevents_tble, by = "subject_id")
#joined_tble
# Step 2: Filter rows where lab event storetime is before ICU stay intime
pre_icu_labevents <- icustays_tble %>% # Step 1: Join the tables on subject_id
left_join(labevents_tble, by = "subject_id") %>%
filter(storetime < intime)Warning in left_join(., labevents_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 845 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
pre_icu_labevents# A tibble: 14,674,449 × 23
subject_id hadm_id.x stay_id first_careunit last_careunit intime
<dbl> <dbl> <dbl> <chr> <chr> <dttm>
1 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
2 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
3 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
4 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
5 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
6 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
7 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
8 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
9 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
10 10000032 29079034 3.96e7 Medical Inten… Medical Inte… 2180-07-23 14:00:00
# ℹ 14,674,439 more rows
# ℹ 17 more variables: outtime <dttm>, los <dbl>, labevent_id <int>,
# hadm_id.y <int>, specimen_id <int>, itemid <int>, order_provider_id <chr>,
# charttime <dttm>, storetime <dttm>, value <chr>, valuenum <dbl>,
# valueuom <chr>, ref_range_lower <dbl>, ref_range_upper <dbl>, flag <chr>,
# priority <chr>, comments <chr>
# Step 3: Pick out the storetime that is maximized but not greater than intime
last_labevents_pre_icu <- pre_icu_labevents %>%
group_by(subject_id, stay_id, itemid) %>%
summarise(storetime = max(storetime)) `summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
# Get maximum storetime for each group
# Group: per itemid for per stay for per patient
last_labevents_pre_icu# A tibble: 525,174 × 4
# Groups: subject_id, stay_id [68,467]
subject_id stay_id itemid storetime
<dbl> <dbl> <int> <dttm>
1 10000032 39553978 50882 2180-07-23 00:59:00
2 10000032 39553978 50902 2180-07-23 00:59:00
3 10000032 39553978 50912 2180-07-23 00:59:00
4 10000032 39553978 50931 2180-07-23 00:59:00
5 10000032 39553978 50971 2180-07-23 01:14:00
6 10000032 39553978 50983 2180-07-23 00:59:00
7 10000032 39553978 51221 2180-07-23 00:00:00
8 10000032 39553978 51301 2180-07-23 00:00:00
9 10000980 39765666 50882 2189-06-27 00:56:00
10 10000980 39765666 50902 2189-06-27 00:56:00
# ℹ 525,164 more rows
# Step 4: Join the last lab events before ICU stay with the pre_icu_labevents
final_labevents_pre_icu <- last_labevents_pre_icu %>%
left_join(pre_icu_labevents,
by = c("subject_id", "stay_id", "storetime", "itemid"))
final_labevents_pre_icu# A tibble: 525,243 × 23
# Groups: subject_id, stay_id [68,467]
subject_id stay_id itemid storetime hadm_id.x first_careunit
<dbl> <dbl> <int> <dttm> <dbl> <chr>
1 10000032 39553978 50882 2180-07-23 00:59:00 29079034 Medical Intensive C…
2 10000032 39553978 50902 2180-07-23 00:59:00 29079034 Medical Intensive C…
3 10000032 39553978 50912 2180-07-23 00:59:00 29079034 Medical Intensive C…
4 10000032 39553978 50931 2180-07-23 00:59:00 29079034 Medical Intensive C…
5 10000032 39553978 50971 2180-07-23 01:14:00 29079034 Medical Intensive C…
6 10000032 39553978 50983 2180-07-23 00:59:00 29079034 Medical Intensive C…
7 10000032 39553978 51221 2180-07-23 00:00:00 29079034 Medical Intensive C…
8 10000032 39553978 51301 2180-07-23 00:00:00 29079034 Medical Intensive C…
9 10000980 39765666 50882 2189-06-27 00:56:00 26913865 Medical Intensive C…
10 10000980 39765666 50902 2189-06-27 00:56:00 26913865 Medical Intensive C…
# ℹ 525,233 more rows
# ℹ 17 more variables: last_careunit <chr>, intime <dttm>, outtime <dttm>,
# los <dbl>, labevent_id <int>, hadm_id.y <int>, specimen_id <int>,
# order_provider_id <chr>, charttime <dttm>, value <chr>, valuenum <dbl>,
# valueuom <chr>, ref_range_lower <dbl>, ref_range_upper <dbl>, flag <chr>,
# priority <chr>, comments <chr>
# Step 5: Pivot wider
wider_lab <- final_labevents_pre_icu %>%
select(subject_id, stay_id, itemid, valuenum) %>%
arrange(subject_id, stay_id) %>%
pivot_wider(
names_from = itemid,
values_from = valuenum,
values_fn = list(valuenum = last) # Use last to get the last measurement
)
wider_lab# A tibble: 68,467 × 10
# Groups: subject_id, stay_id [68,467]
subject_id stay_id `50882` `50902` `50912` `50931` `50971` `50983` `51221`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 25 95 0.7 102 6.7 126 41.1
2 10000980 39765666 21 109 2.3 89 3.9 144 27.3
3 10001217 34592300 30 104 0.5 87 4.1 142 37.4
4 10001217 37067082 22 108 0.6 112 4.2 142 38.1
5 10001725 31205490 NA 98 NA NA 4.1 139 NA
6 10001884 37510196 30 88 1.1 141 4.5 130 39.7
7 10002013 39060235 24 102 0.9 288 3.5 137 34.9
8 10002155 31090461 23 98 2.8 117 4.9 135 25.5
9 10002155 32358465 26 85 1.4 133 5.7 120 22.4
10 10002155 33685454 24 105 1.1 138 4.6 139 39.7
# ℹ 68,457 more rows
# ℹ 1 more variable: `51301` <dbl>
# Step 6: Rename the original itemid to the corresponding lab measurement
wider_lab <- wider_lab %>%
rename(
creatinine = `50912`,
potassium = `50971`,
sodium = `50983`,
chloride = `50902`,
bicarbonate = `50882`,
hematocrit = `51221`,
wbc = `51301`,
glucose = `50931`
)
wider_lab# A tibble: 68,467 × 10
# Groups: subject_id, stay_id [68,467]
subject_id stay_id bicarbonate chloride creatinine glucose potassium sodium
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 25 95 0.7 102 6.7 126
2 10000980 39765666 21 109 2.3 89 3.9 144
3 10001217 34592300 30 104 0.5 87 4.1 142
4 10001217 37067082 22 108 0.6 112 4.2 142
5 10001725 31205490 NA 98 NA NA 4.1 139
6 10001884 37510196 30 88 1.1 141 4.5 130
7 10002013 39060235 24 102 0.9 288 3.5 137
8 10002155 31090461 23 98 2.8 117 4.9 135
9 10002155 32358465 26 85 1.4 133 5.7 120
10 10002155 33685454 24 105 1.1 138 4.6 139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>
labevents_tble <- wider_lab %>% ungroup()
labevents_tble# A tibble: 68,467 × 10
subject_id stay_id bicarbonate chloride creatinine glucose potassium sodium
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 25 95 0.7 102 6.7 126
2 10000980 39765666 21 109 2.3 89 3.9 144
3 10001217 34592300 30 104 0.5 87 4.1 142
4 10001217 37067082 22 108 0.6 112 4.2 142
5 10001725 31205490 NA 98 NA NA 4.1 139
6 10001884 37510196 30 88 1.1 141 4.5 130
7 10002013 39060235 24 102 0.9 288 3.5 137
8 10002155 31090461 23 98 2.8 117 4.9 135
9 10002155 32358465 26 85 1.4 133 5.7 120
10 10002155 33685454 24 105 1.1 138 4.6 139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>
# Free RAM usage for subsequent rendering
rm(pre_icu_labevents,
last_labevents_pre_icu,
final_labevents_pre_icu,
wider_lab)Q6. Vitals from charted events
chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
zcat < ~/mimic/icu/chartevents.csv.gz | headd_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.
zcat < ~/mimic/icu/d_items.csv.gz | headWe are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.
Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.
Answer:
# chunk 37
chartevents_tble <- arrow::open_dataset("chartevents_pq",
format = "parquet") %>%
filter(itemid %in% c(220045, 220180, 220179, 223761, 220210) &
subject_id %in% icustays_tble$subject_id) %>%
collect()
chartevents_tble# A tibble: 22,504,119 × 11
subject_id hadm_id stay_id caregiver_id charttime
<int> <int> <int> <int> <dttm>
1 10001884 26184834 37510196 46969 2131-01-12 12:01:00
2 10001884 26184834 37510196 46969 2131-01-12 12:01:00
3 10001884 26184834 37510196 46969 2131-01-12 13:00:00
4 10001884 26184834 37510196 46969 2131-01-12 13:00:00
5 10001884 26184834 37510196 46969 2131-01-12 13:01:00
6 10001884 26184834 37510196 46969 2131-01-12 13:01:00
7 10001884 26184834 37510196 46969 2131-01-12 14:00:00
8 10001884 26184834 37510196 46969 2131-01-12 14:00:00
9 10001884 26184834 37510196 46969 2131-01-12 14:01:00
10 10001884 26184834 37510196 46969 2131-01-12 14:01:00
# ℹ 22,504,109 more rows
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
# valuenum <dbl>, valueuom <chr>, warning <int>
chartevents_tble_backup <- chartevents_tble# Step 1: Pick out the minimum of charttime per ICU stay
first_chartevents_icu <- chartevents_tble %>%
left_join(icustays_tble, by = c("subject_id", "stay_id")) %>%
filter(charttime >= intime & charttime <= outtime) %>%
group_by(subject_id, stay_id, itemid) %>%
summarise(charttime = min(charttime)) %>%
ungroup()`summarise()` has grouped output by 'subject_id', 'stay_id'. You can override
using the `.groups` argument.
first_chartevents_icu# A tibble: 362,465 × 4
subject_id stay_id itemid charttime
<dbl> <dbl> <int> <dttm>
1 10000032 39553978 220045 2180-07-23 07:12:00
2 10000032 39553978 220179 2180-07-23 07:11:00
3 10000032 39553978 220180 2180-07-23 07:11:00
4 10000032 39553978 220210 2180-07-23 07:12:00
5 10000032 39553978 223761 2180-07-23 07:00:00
6 10000980 39765666 220045 2189-06-27 01:56:00
7 10000980 39765666 220179 2189-06-27 01:55:00
8 10000980 39765666 220180 2189-06-27 01:55:00
9 10000980 39765666 220210 2189-06-27 01:54:00
10 10000980 39765666 223761 2189-06-27 02:07:00
# ℹ 362,455 more rows
# Step 2: Join the first_chartevents_icu with the chartevents_tble
final_chartevents_icu <- first_chartevents_icu %>%
left_join(chartevents_tble,
by = c("subject_id", "stay_id", "charttime", "itemid"))
final_chartevents_icu# A tibble: 362,465 × 11
subject_id stay_id itemid charttime hadm_id caregiver_id
<dbl> <dbl> <int> <dttm> <int> <int>
1 10000032 39553978 220045 2180-07-23 07:12:00 29079034 88981
2 10000032 39553978 220179 2180-07-23 07:11:00 29079034 88981
3 10000032 39553978 220180 2180-07-23 07:11:00 29079034 88981
4 10000032 39553978 220210 2180-07-23 07:12:00 29079034 88981
5 10000032 39553978 223761 2180-07-23 07:00:00 29079034 88981
6 10000980 39765666 220045 2189-06-27 01:56:00 26913865 36518
7 10000980 39765666 220179 2189-06-27 01:55:00 26913865 36518
8 10000980 39765666 220180 2189-06-27 01:55:00 26913865 36518
9 10000980 39765666 220210 2189-06-27 01:54:00 26913865 36518
10 10000980 39765666 223761 2189-06-27 02:07:00 26913865 36518
# ℹ 362,455 more rows
# ℹ 5 more variables: storetime <dttm>, value <chr>, valuenum <dbl>,
# valueuom <chr>, warning <int>
# Step 3: Pivot wider
wider_chart <- final_chartevents_icu %>%
select(subject_id, stay_id, itemid, valuenum) %>%
arrange(subject_id, stay_id) %>%
pivot_wider(
names_from = itemid,
values_from = valuenum,
values_fn = list(valuenum = first) # Use first to get the first measurement
)
wider_chart# A tibble: 73,164 × 7
subject_id stay_id `220045` `220179` `220180` `220210` `223761`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 91 84 48 24 98.7
2 10000980 39765666 77 150 77 23 98
3 10001217 34592300 96 167 95 11 97.6
4 10001217 37067082 86 151 90 18 98.5
5 10001725 31205490 55 73 56 19 97.7
6 10001884 37510196 38 180 12 10 98.1
7 10002013 39060235 80 104 70 14 97.2
8 10002155 31090461 94 118 51 18 96.9
9 10002155 32358465 98 109 65 23 97.7
10 10002155 33685454 68 126 61 18 95.9
# ℹ 73,154 more rows
# Step 4: Rename the original itemid to the corresponding vital measurement
wider_chart <- wider_chart %>%
rename(
heart_rate = `220045`,
non_invasive_blood_pressure_systolic = `220179`,
non_invasive_blood_pressure_diastolic = `220180`,
temperature_fahrenheit = `223761`,
respiratory_rate = `220210`
)
wider_chart# A tibble: 73,164 × 7
subject_id stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10000032 39553978 91 84 48
2 10000980 39765666 77 150 77
3 10001217 34592300 96 167 95
4 10001217 37067082 86 151 90
5 10001725 31205490 55 73 56
6 10001884 37510196 38 180 12
7 10002013 39060235 80 104 70
8 10002155 31090461 94 118 51
9 10002155 32358465 98 109 65
10 10002155 33685454 68 126 61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹non_invasive_blood_pressure_systolic,
# ²non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: respiratory_rate <dbl>, temperature_fahrenheit <dbl>
chartevents_tble <- wider_chart %>% ungroup()
# Free RAM usage for subsequent rendering
rm(first_chartevents_icu, final_chartevents_icu, wider_chart)Q7. Putting things together
Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables
- all variables in
icustays_tble
- all variables in
admissions_tble
- all variables in
patients_tble - the last lab measurements before the ICU stay in
labevents_tble - the first vital measurements during the ICU stay in
chartevents_tble
The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.
Answer:
# Step 1: Join the icustays_tble with the patients_tble
mimic_icu_cohort <- icustays_tble %>%
left_join(patients_tble, by = "subject_id")
# Step 2: Join the mimic_icu_cohort with the admissions_tble and
# filter out the patients who are less than 18 years old
mimic_icu_cohort <- mimic_icu_cohort %>%
mutate(age_intime = year(intime) - anchor_year + anchor_age)
mimic_icu_cohort <- mimic_icu_cohort %>%
left_join(admissions_tble, by = c("subject_id", "hadm_id")) %>%
filter(age_intime >= 18)
# Step 3: Join the mimic_icu_cohort with labevents_tble
mimic_icu_cohort <- mimic_icu_cohort %>%
left_join(labevents_tble, by = c("subject_id", "stay_id"))
# Step 4: Join the mimic_icu_cohort with chartevents_tble
mimic_icu_cohort <- mimic_icu_cohort %>%
left_join(chartevents_tble, by = c("subject_id", "stay_id"))
mimic_icu_cohort# A tibble: 73,181 × 41
subject_id hadm_id stay_id first_careunit last_careunit intime
<dbl> <dbl> <dbl> <chr> <chr> <dttm>
1 10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
2 10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
3 10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
4 10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
5 10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
6 10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
7 10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
8 10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
9 10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10 10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, gender <chr>,
# anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
# age_intime <dbl>, admittime <dttm>, dischtime <dttm>, deathtime <dttm>,
# admission_type <chr>, admit_provider_id <chr>, admission_location <chr>,
# discharge_location <chr>, insurance <chr>, language <chr>,
# marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>, …
Q8. Exploratory data analysis (EDA)
Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:
Length of ICU stay
losvs demographic variables (race, insurance, marital_status, gender, age at intime)Length of ICU stay
losvs the last available lab measurements before ICU stayLength of ICU stay
losvs the average vital measurements within the first hour of ICU stayLength of ICU stay
losvs first ICU unit
Answer:
8.1 Length of ICU stay los vs. demographic variables
Age at intime
# Scatter plot of los vs. age
los_age <- mimic_icu_cohort %>%
group_by(age_intime) %>%
summarise(meanlos = mean(los))
meanlos_max <- los_age %>% slice_max(order_by = meanlos, n = 1)
meanlos_min <- los_age %>% slice_min(order_by = meanlos, n = 1)
los_age %>%
ggplot(aes(x = age_intime, y = meanlos)) +
geom_line() +
geom_smooth(color = "orange") +
labs(
x = "Age at intime",
y = "Averaged Length of ICU stay (days)",
title = "Length of ICU stay vs. age"
) +
geom_point(data = meanlos_max,
aes(x = age_intime, y = meanlos),
color = "red",
size = 5) +
geom_point(data = meanlos_min,
aes(x = age_intime, y = meanlos),
color = "blue",
size = 5) +
annotate("text",
x = meanlos_max$age_intime,
y = meanlos_max$meanlos + 0.2,
label = str_c("Max: ", round(meanlos_max$meanlos, 2), " days, age ",
meanlos_max$age_intime),
color = "red") +
annotate("text",
x = meanlos_min$age_intime - 10,
y = meanlos_min$meanlos - 0.2,
label = str_c("Min: ", round(meanlos_min$meanlos, 2), " days, age ",
meanlos_min$age_intime),
color = "blue") +
theme_light()`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
- From the line plot, we can see that, as the
age_intimeincreases, the averaged length of ICU stay increases at first until somewhere around 70 years old, then it starts to decrease. The maximum averaged length of ICU stay is4.61days at age27, and the minimum is1.88days at age102. This may be due to the fact that the older the patient, the more likely they are to have more severe diseases, which may lead to an earlier clinical death that shortens the length of ICU stay.
Race
los_race <- mimic_icu_cohort %>%
select(race, los) %>%
group_by(race) %>%
summarise(mean_los = mean(los)) %>%
arrange(desc(mean_los)) %>%
print()# A tibble: 33 × 2
race mean_los
<chr> <dbl>
1 AMERICAN INDIAN/ALASKA NATIVE 4.46
2 PORTUGUESE 4.26
3 ASIAN - KOREAN 4.23
4 UNABLE TO OBTAIN 4.22
5 UNKNOWN 4.20
6 ASIAN - ASIAN INDIAN 4.20
7 HISPANIC/LATINO - DOMINICAN 3.86
8 BLACK/AFRICAN 3.82
9 WHITE - OTHER EUROPEAN 3.66
10 BLACK/CARIBBEAN ISLAND 3.61
# ℹ 23 more rows
los_race %>% head(1)# A tibble: 1 × 2
race mean_los
<chr> <dbl>
1 AMERICAN INDIAN/ALASKA NATIVE 4.46
los_race %>% tail(1)# A tibble: 1 × 2
race mean_los
<chr> <dbl>
1 HISPANIC/LATINO - MEXICAN 2.65
- From the numeric summary, we can see that the mean length of ICU stay is the longest for race
AMERICAN INDIAN/ALASKA NATIVEwhich is4.455923and the shortest for raceHISPANIC/LATINO - MEXICANwhich is2.648043.
Insurance
mimic_icu_cohort %>%
ggplot(aes(x = insurance, y = los)) +
geom_boxplot(fill = "lightskyblue",
color = "darkblue") +
labs(
title = "Length of ICU stay vs. insurance"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_light()mimic_icu_cohort %>%
group_by(insurance) %>%
summarise(mean_los = mean(los)) %>%
arrange(desc(mean_los)) %>%
print()# A tibble: 3 × 2
insurance mean_los
<chr> <dbl>
1 Medicare 3.48
2 Medicaid 3.46
3 Other 3.42
- From the result, we can see that the averaged length of ICU stay is the longest for
Medicareand the shortest forOther. But the difference between each group is not very noticeable.
Marital status
mimic_icu_cohort %>%
ggplot(aes(x = marital_status, y = los)) +
geom_boxplot(fill = "lightskyblue",
color = "darkblue") +
labs(
title = "Length of ICU stay vs. marital status"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_light()mimic_icu_cohort %>%
group_by(marital_status) %>%
summarise(mean_los = mean(los)) %>%
arrange(desc(mean_los)) %>%
print()# A tibble: 5 × 2
marital_status mean_los
<chr> <dbl>
1 <NA> 4.28
2 MARRIED 3.46
3 SINGLE 3.40
4 DIVORCED 3.39
5 WIDOWED 3.13
- Again, the difference of averaged
losbetween each group is not very noticeable for those who stated theirmarital_status. One thing to note is that the averagedlosfor groupNAis the longest which is about4.28days.
Gender
mimic_icu_cohort %>%
ggplot(aes(x = gender, y = los)) +
geom_boxplot(fill = "lightskyblue",
color = "darkblue") +
labs(
title = "Length of ICU stay vs. gender"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_light()mimic_icu_cohort %>%
group_by(gender) %>%
summarise(mean_los = mean(los)) %>%
arrange(desc(mean_los)) %>%
print()# A tibble: 2 × 2
gender mean_los
<chr> <dbl>
1 M 3.54
2 F 3.34
- From the result, we can see that the averaged
losis about3.54days inMalegroup and3.34days inFemalegroup. AlthoughMale’s averagedlosis slightly longer thanFemale’s, the difference is not very noticeable.
8.2 Length of ICU stay los vs. the last available lab measurements before ICU stay
mimic_icu_cohort %>%
select(los, creatinine, potassium, sodium, chloride, bicarbonate, hematocrit,
wbc, glucose) %>%
pivot_longer(c("creatinine", "potassium", "sodium", "chloride", "bicarbonate",
"hematocrit", "wbc", "glucose"),
names_to = "measurement",
values_to = "value") %>%
ggplot(aes(x = value, y = los)) +
geom_point(aes(x = value, y = los, color = measurement),
alpha = 0.3) +
geom_smooth() +
facet_grid(measurement ~ .,
scales = "free")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 60686 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 60686 rows containing missing values (`geom_point()`).
From the plot we can see that there is no clear pattern between
losand the most of the last available lab measurements before ICU stay.However, from the plot we can glimpse that there is a subtle trend that a higher
bicaarbonatelevel and a higherhematocritlevel are associated with a longerlos, while a highersodiumlevel and a higherwbclevel are associated with a shorterlos.
8.3 Length of ICU stay los vs. the average vital measurements within the first hour of ICU stay
mimic_icu_cohort %>%
select(los, heart_rate, non_invasive_blood_pressure_systolic,
non_invasive_blood_pressure_diastolic, temperature_fahrenheit,
respiratory_rate) %>%
pivot_longer(c("heart_rate",
"non_invasive_blood_pressure_systolic",
"non_invasive_blood_pressure_diastolic",
"temperature_fahrenheit",
"respiratory_rate"),
names_to = "measurement",
values_to = "value") %>%
filter(value < quantile(value, 0.99, na.rm = TRUE) &
value > quantile(value, 0.01, na.rm = TRUE)) %>% # Remove outliers
ggplot(aes(x = value, y = los)) +
geom_point(aes(x = value, y = los, color = measurement),
alpha = 0.5) +
geom_smooth() +
facet_grid(measurement ~ .)`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
- From the plot we can see that there is no clear pattern between
losand the vital measurements within the first hour of ICU stay.
8.4 Length of ICU stay los vs. first ICU unit
# Graphical summary
mimic_icu_cohort %>%
group_by(first_careunit) %>%
filter(los < quantile(los, 0.99, na.rm = TRUE)) %>% # Remove outliers
ggplot(aes(x = first_careunit, y = los)) +
geom_boxplot(fill = "lightskyblue",
color = "darkblue") +
labs(
title = "Length of ICU stay vs. first ICU unit"
) +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Numeric summary
mimic_icu_cohort %>%
group_by(first_careunit) %>%
summarise(mean_los = mean(los)) %>%
arrange(desc(mean_los)) %>%
print()# A tibble: 9 × 2
first_careunit mean_los
<chr> <dbl>
1 Neuro Surgical Intensive Care Unit (Neuro SICU) 6.30
2 Surgical Intensive Care Unit (SICU) 3.84
3 Trauma SICU (TSICU) 3.83
4 Neuro Intermediate 3.42
5 Cardiac Vascular Intensive Care Unit (CVICU) 3.29
6 Medical Intensive Care Unit (MICU) 3.26
7 Coronary Care Unit (CCU) 3.19
8 Medical/Surgical Intensive Care Unit (MICU/SICU) 3.08
9 Neuro Stepdown 2.59
From the result we can see that
Neuro SICUhas the longest averagedloswhich is about6.30days, andNeuro Stepdownhas the shortest averagedloswhich is about2.59days.Neuro SICU’s averagedlosis significantly longer than the rest of the units, from which we can infer that patients inNeuro SICUmay have more severe diseases that require a longer ICU stay.